Assessing the potential role of ChatGPT in spine surgery research

Abstract Purpose Since its release in November 2022, Chat Generative Pre‐Trained Transformer 3.5 (ChatGPT), a complex machine learning model, has garnered more than 100 million users worldwide. The aim of this study is to determine how well ChatGPT can generate novel systematic review ideas on topics within spine surgery. Methods ChatGPT was instructed to give ten novel systematic review ideas for five popular topics in spine surgery literature: microdiscectomy, laminectomy, spinal fusion, kyphoplasty and disc replacement. A comprehensive literature search was conducted in PubMed, CINAHL, EMBASE and Cochrane. The number of nonsystematic review articles and number of systematic review papers that had been published on each ChatGPT‐generated idea were recorded. Results Overall, ChatGPT had a 68% accuracy rate in creating novel systematic review ideas. More specifically, the accuracy rates were 80%, 80%, 40%, 70% and 70% for microdiscectomy, laminectomy, spinal fusion, kyphoplasty and disc replacement, respectively. However, there was a 32% rate of ChatGPT generating ideas for which there were 0 nonsystematic review articles published. There was a 71.4%, 50%, 22.2%, 50%, 62.5% and 51.2% success rate of generating novel systematic review ideas, for which there were also nonsystematic reviews published, for microdiscectomy, laminectomy, spinal fusion, kyphoplasty, disc replacement and overall, respectively. Conclusions ChatGPT generated novel systematic review ideas at an overall rate of 68%. ChatGPT can help identify knowledge gaps in spine research that warrant further investigation, when used under supervision of an experienced spine specialist. This technology can be erroneous and lacks intrinsic logic; so, it should never be used in isolation. Level of Evidence Not applicable.


BACKGROUND
Artificial intelligence (AI) has emerged as a promising tool in the field of medicine due to its ability to analyze large datasets, identify patterns and generate predictions with clinical relevance.The application of AI has been previously described in the field of spine surgery [5,9,18,21,29,35,39].For instance, unique algorithms can generate predictions on outcomes of conservative versus surgical treatment in patients with spine pathology [21].AI refers to the study of algorithms that provide machines with reasoning and cognitive abilities, such as decision-making and problem-solving [24].Machine learning refers to the ability of machines to learn through pattern recognition; it is particularly suited to predictions based on existing data [10].Machine learning has numerous applications in medicine, such as prediction of surgical site infections by creating models that encompass diagnoses, treatments and laboratory values [44].A study conducted in 2017 found that machine learning could predict patient lung cancer staging in the Surveillance, Epidemiology and End Results Cancer Registry with high sensitivity, specificity and accuracy [6].
On November 30, 2022, Chat Generative Pretrained Transformer 3.5 (ChatGPT) OpenAI, created by a San Francisco-based AI laboratory, became available to the general public [4].Since its release, ChatGPT has garnered attention for its impressive performance on all three components of the United States Medical Licensing Exam and ability to deceive scientists with its abstract-writing skills [15,30].Studies in the field of plastic surgery have explored the potential utility of ChatGPT in research generation, grant writing and patient consultations [20,38,49].A study by He et al. described the possible roles of ChatGPT in spinal surgical practice, such as surgical planning, patient data collection and postoperative rehabilitation guidance [25].ChatGPT is trained on more than 1.6 billion parameters, allowing it to interact with users, appropriately respond to questions, learn from and acknowledge previous mistakes and manipulate data [7,41].
Research is a crucial component of the growing field of spine surgery, and thus, systematic reviews offer considerable utility.Systematic reviews are investigations that consolidate and analyze relevant findings of all available studies concerning a specific research question.Systematic reviews coincide with the practice of evidence-based medicine: treatment of patients through the integration of personal clinical experience with findings from intensive, high-standard research [13,19].Physicians utilize systematic reviews to learn about updates in the field and appropriately adjust their clinical practices.Systematic reviews can also serve as justification for further research from the perspective of granting agencies, clinical practice guideline developers and healthcare agencies [19].As such, subject areas within the field of spine surgery that lack systematic reviews represent potential knowledge gaps.However, systematic reviews can prove to be time and resource-intensive task, costly and timeconsuming-requiring an average of 6 months to several years to complete [47].Furthermore, it can be difficult to ensure that a systematic review idea is both novel and clinically relevant by a researcher alone.Machine learning might have a role in ameliorating these practical concerns for spine surgeons conducting research.Therefore, the purpose of this study is to investigate the ability of ChatGPT to provide novel systematic review proposals on topics within spine surgery.

METHODS
On 4 February 2023, ChatGPT-3.5 was instructed to generate ten novel systematic review proposals for five popular topics in spine surgery literature: microdiscectomy, laminectomy, spinal fusion, kyphoplasty and disc replacement.This was accomplished using the instruction, 'Give 10 novel systematic review ideas on [topic]' for each of the five aforementioned topics.Therefore, a total of 50 systematic review ideas were generated for this study.A new ChatGPT account was utilized in order to eliminate user history bias.Furthermore, prompts for each of the five topics were given in separate chats.
For each of the 50 ChatGPT-generated novel systematic review ideas, a comprehensive literature search was conducted.PubMed, CINAHL, EMBASE and Cochrane were separately queried for articles that covered the respective ChatGPT-generated topic.Duplicates were removed, and all search results were assessed to verify their relevance to the ChatGPTgenerated idea.The number of nonsystematic review articles and number of systematic review papers that had been published in 2023 or prior were recorded and totalled for each generated idea.An idea was considered novel if there were no systematic reviews already published on the topic.Papers were reviewed using the full-text article when available.If there were no published nonsystematic review articles on an idea, this was also noted.The title, lead author and date of publication were recorded for each published systematic review.If an article was published after September 2021, this was recorded.To decrease the possibility of erroneously including or removing studies, two reviewers (I.H., D.M.) separately assessed each article.If there were any discrepancies, a third reviewer (M.V.) was consulted to make the final decision.The overall accuracy rate for novelty was calculated by dividing the number of novel ChatGPT-generated ideas by 50.Topic-specific accuracy rates for novelty were calculated in a similar fashion for microdiscectomy, laminectomy, spinal fusion, kyphoplasty and disc replacement by dividing the number of novel ChatGPT-generated ideas by 10.

RESULTS
Overall, we determined that ChatGPT was able to generate 34 novel systematic review proposals, corresponding to a 68% accuracy rate.Tables 1-5 present the ChatGPT-generated idea with the corresponding number of nonsystematic reviews (non-SRs), number of SRs and novelty status for microdiscectomy, laminectomy, spinal fusion, kyphoplasty and disc replacement that were respectively identified by traditional manual literature search methods.
Topics related to microdiscectomy and laminectomy had the highest accuracy rates-80%.Spinal fusion had the lowest accuracy rate, calculated to be 40%.Kyphoplasty and disc replacement both had accuracy rates of 70%.In addition, there was an overall 32% rate of ChatGPT generating ideas for which there were 0 nonsystematic review articles published.With this as consideration, there was a 71.4%, 50%, 33.3%, 50%, 62.5% and 51.2% success rate of generating novel systematic review ideas, for which there were also nonsystematic reviews published, for microdiscectomy, laminectomy, spinal fusion, kyphoplasty, disc replacement and overall, respectively.

DISCUSSION
This study demonstrates that ChatGPT is a feasible tool for aiding spine surgeons in the identification of knowledge gaps and construction of novel systematic review ideas.Previous studies have shown that clinician-machine interaction can enhance decisionmaking [24].Machine learning technology, such as ChatGPT, can serve as powerful tools for uncovering subtle or overlooked patterns in available data.ChatGPT can swiftly browse through large amounts of data, such as patient records, scientific papers and textbooks, to create novel hypotheses in medicine [14].
During the manual novelty analysis for the 50 ChatGPT-generated ideas, 78 individual systemic reviews that were already published on those topics were identified (Appendix A).A notable limitation of ChatGPT is that it was trained on a data set last updated in September 2021 [45].This was evident in our study, as some manually identified systematic reviews were published after this date.Specifically, eight systematic reviews were published after 2021 and were most likely not incorporated into ChatGPT training [8,16,17,22,32,33,42,46].Therefore, ChatGPT cannot provide users with completely current, accurate information more recent than its last update.Second, 11 published systematic review titles and abstracts did not include the phrases 'systematic review', 'Cochrane review' or 'meta-analysis', which could have led to ChatGPT missing these articles [11, 12, 23, 26-28, 31, 36, 37, 40, 43].Similarly, ChatGPT appeared to miss some articles that did not directly include the procedure name in the title.For instance, during the microdiscectomy portion of the study, two titles with the phrase 'operative approaches for lumbar disc herniation'without including 'microdiscectomy'-failed to be acknowledged by ChatGPT [3,48].In these instances, the failure of ChatGPT to recognize these studies may be its lack of access to the full text of these studies since such access may require individual or institutional subscription.Finally, ChatGPT might have had issues recognizing synonyms for medical and procedural terminology or may not have yet 'learned' how certain terms are grouped into larger categories.For example, ChatGPT proposed the systematic review 'The role of biologics in promoting spinal fusion success' as novel.Manual search, however, identified multiple systematic reviews investigating the effectiveness of bone morphogenetic proteins, autologous stem cells and osteoinductive bone graft substitutes [1,2,34,50].Upon review, these studies did not have the term 'biologics' in their publicly available title, abstract, MeSH terms or keywords; so ChatGPT may not have been able to identify them as fitting into this larger category.
There were 16 instances of ChatGPT generating a systematic review idea for which there were no nonsystematic reviews published.This occurrence was most notable for laminectomy (50%), followed by kyphoplasty (40%), microdiscectomy (30%), disc replacement (20%) and spinal fusion (10%).To account for this discrepancy, this study also calculated the overall accuracy rate for which there were nonsystematic reviews published on that topic.In these cases, the absence of nonsystematic reviews was generally indicative of a paucity of clinical studies on the proposed topic.Given this, these 16 systematic review proposals could not currently be performed as actual studies, but the topics generated may shed light on research gaps in spine surgery.For instance, these ChatGPT-generated ideas could inspire other types of studies within spine surgery, such as cohort studies, randomized controlled trials and so on.The potential role of ChatGPT in generating salient, novel ideas for other study types should be explored in future studies.
ChatGPT has been observed to sometimes provide answers that appear plausible at first glance but are actually meaningless or illogical.To a degree, this is inevitable, as the current ChatGPT model lacks common sense and practical knowledge.For example, in the present study, it is unclear why ChatGPT missed several relevant systematic reviews for kyphoplasty and disc replacement since the titles of these articles closely resembled the respective ChatGPT-generated idea.Despite these shortcomings, ChatGPT can still facilitate improvement in spine surgery research, which, in turn, serves to improve patient management and outcomes through the practice of evidence-based medicine.Therefore, we recommend that ChatGPT should never be used blindly but utilized in conjunction with the expertise of a spine surgeon to identify and refine research ideas.We also do not recommend utilizing ChatGPT or other large language models for authorship of research manuscripts, as such would compromise scholarly integrity.This was a pilot study that aimed to determine whether ChatGPT could generate suggestions for both systematic reviews and nonsystematic reviews in a novel fashion.The rationale was that if there were nonsystematic reviews on a topic, it would suggest that there was potential feasibility for designing a systematic review on that topic.For the topics generated that had no systematic reviews or nonsystematic reviews, it was believed these would be topics that merit individual attention by a research team to determine their feasibility.Narrowing the eligibility criteria further may have discarded ideas with potential due to nuances of wording in the title or content of the text at the level of an individual study.However, this is certainly an important point, and determining if ChatGPT can independently establish feasibility should be evaluated in future studies.This study found several topic ideas to be meaningless or illogical, further demonstrating that feasibility warrants evaluation.Tables 1-5 list each ChatGPT topic suggestion verbatim so that readers can adequately comprehend the nature of these ideas.
This study was intended as proof of concept and should be expanded in future studies.First, this study only instructed ChatGPT to generate 50 ideas; increasing the number of generated ideas might affect the accuracy rate.Investigating the effects of specificity in regard to topics might also improve the accuracy rate, as microdiscectomy, laminectomy, spinal fusion, kyphoplasty and disc replacement are relatively broad terms within the field.This study did not fully take advantage of the supervised and reinforcement learning nature of ChatGPT.Future studies should experiment with this, as this is expected to increase accuracy rates and general convenience.A limitation of this study is that the phrasing of prompts given to ChatGPT could have an impact on the results; this study does not investigate whether different prompts significantly alter the generated response.Prompt phrasing was kept the same for each of the five topics: microdiscectomy, laminectomy, spinal fusion, kyphoplasty and disc replacement.To increase standardization between LLM studies, the phrasing for prompts in this study was modelled from similar literature [20].The purpose of this preliminary study was simply to assess ChatGPT's ability to generate novel systematic ideas without additional guidance or corrections.
Due to the complexity of surgical interventions, technologies and approaches utilized to manage spinerelated disorders, systematic reviews play an important role in the field of spine surgery.Systematic reviews can bridge the gaps in knowledge of a plethora of spine topics, but the sheer volume of data can be overwhelming for researchers to manually identify and analyze.Machine learning technologies, such as ChatGPT, may solve this problem: allowing researchers to analyze large volumes of published literature more quickly while maintaining acceptable accuracy.

CONCLUSION
ChatGPT, an AI chatbot, successfully generated novel systematic review ideas related to spine surgery at an overall rate of 68%.As this technology improves through supervised learning, reinforcement learning and dataset expansion, there is a role for ChatGPT as a research tool.The application of ChatGPT can identify knowledge gaps in spine research that warrant further investigation and improve research efficiency by eliminating the need to manually examine hundreds to thousands of publications.With more specific instructions and user-driven modulation, the accuracy of ChatGPT in this role can be increased.

| 3 of 14 TA B L E 1 0 Y 7 2 N 0 YA 0 Y 0 YA 0 Y 8 1 NA 0 Y 3 N 0 Y 0 Y 0 Y 2 1 N 10 2 Y 0 Y 0 Y 0 Y 0 Y| 5 of 14 TA B L E 3 0 Y 2 N 0 Y 2 N 12 N 5 N 0 Y 0 Y 1 N 4 N 4 N 0 Y 0 Y 4 N 3 0 Y| 7 of 14 TA B L E 5 8 NA 1 N 0 Y 13 N
ChatGPT-generated systematic review ideas and respective accuracy rates for microdiscectomy: Number of nonsystematic reviews (non-SRs), number of SRs and novelty status shown. of the incidence of recurrent herniation after microdiscectomy.5Theimpact of patient factors such as obesity and smoking on microdiscectomy outcomes. of the effectiveness of different surgical approaches for lumbar microdiscectomy.The impact of preoperative imaging studies on the success of microdiscectomy.0systematic review of the success rates and complication rates of minimally invasive microdiscectomy compared to traditional open microdiscectomy.1Theinfluence of patient expectations on outcomes after microdiscectomy.0 systematic review of the cost-effectiveness of microdiscectomy compared to alternative treatments for herniated lumbar discs.3Therole of physical therapy and rehabilitation in improving outcomes after microdiscectomy.review of patient-reported outcomes after microdiscectomy, including quality of life and return to work.6 TA B L E 2 ChatGPT-generated systematic review ideas and respective accuracy rates for laminectomy: Number of nonsystematic reviews (non-SRs), number of SRs and novelty status shown.between minimally invasive and traditional open laminectomy for spinal stenosis.8 Analysis of the effect of patient age and comorbidities on outcomes following laminectomy for spinal stenosis.0 Assessment of the long-term outcomes of laminectomy for degenerative disc disease.0 Review of the use of nerve monitoring techniques during laminectomy to minimize the risk of nerve injury.0 Analysis of the role of postoperative physical therapy in optimizing recovery following laminectomy.Comparison of laminectomy with other surgical options for the treatment of herniated discs.Assessment of the impact of different surgical approaches on patient satisfaction after laminectomy.0 Analysis of the risk of reoperation following laminectomy for spinal stenosis.0 Comparison of outcomes between laminectomy and laminoplasty for the treatment of spinal stenosis.3 Examination of the impact of adjuvant treatments such as epidural injections and nerve blocks on outcomes following laminectomy.0 ChatGPT-generated systematic review ideas and respective accuracy rates for spinal fusion: Number of nonsystematic reviews (non-SRs), number of SRs and novelty statusshown.term outcomes of cervical spinal fusion for the treatment of degenerative disc disease.>100Theimpact of smoking on spinal fusion outcomes.>100Thecomparative efficacy of minimally invasive vs traditional spinal fusion for lumbar degenerative conditions.3Thesafety and effectiveness of allograft vs autograft for spinal fusion.5Therole of biologics in promoting spinal fusion success.>100Theimpact of patient characteristics (age, obesity, comorbidities) on spinal fusion outcomes.>100Theeffectiveness of spinal fusion in the treatment of spinal tumours.0Theimpact of revision surgery on the success of spinal fusion.>100Thecost-effectiveness of spinal fusion compared to non-surgical treatments for chronic back pain.10 The effectiveness of different spinal fusion techniques for the treatment of lumbar spinal stenosis.>100 TA B L E 4 ChatGPT-generated systematic review ideas and respective accuracy rates for kyphoplasty: Number of nonsystematic reviews (non-SRs), number of SRs and novelty status shown. of kyphoplasty versus alternative surgical and non-surgical treatments for vertebral fractures.8 Long-term safety and effectiveness of kyphoplasty in elderly patients.outcomes of single-vs multi-level kyphoplasty.0 Adjunctive therapies for improvement of kyphoplasty outcomes, such as physical therapy or pharmacotherapy.0 Kyphoplasty for vertebral compression fractures in patients with spinal tumours.>100 Kyphoplasty in the management of vertebral fractures in patients with Ankylosing Spondylitis.ChatGPT-generated systematic review ideas and respective accuracy rates for disc replacement: Number of nonsystematic reviews (non-SRs), number of SRs and novelty status shown.total disc replacement in the cervical spine.>100comparison of total disc replacement with fusion surgery in the lumbar spine. of the radiological outcomes of total disc replacement surgery.>100Theimpact of age, gender, and body mass index on the outcomes of total disc replacement surgery.4Clinicaloutcomes of total disc replacement in patients with degenerative disc disease.>100Totaldisc replacement for the treatment of herniated nucleus pulposus and radiculopathy.
The short-term efficacy and safety of artificial total disc replacement for selected patients with lumbar degenerative disc disease compared with anterior lumbar interbody fusion: A systematic review and meta-analysisThe safety and efficacy of hybrid surgery for multilevel cervical degenerative disc disease versus anterior cervical discectomy and fusion or cervical disc arthroplasty: a systematic review and meta-analysisHollyerFebruary 2020Efficacy and Safety of Total Disc Replacement Compared with Anterior Cervical Discectomy and Fusion in the Treatment of Cervical Disease: A Meta-analysis